{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "!pip install autokeras\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "from sklearn.datasets import load_files\n",
    "\n",
    "import autokeras as ak\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "To make this tutorial easy to follow, we just treat IMDB dataset as a\n",
    "regression dataset. It means we will treat prediction targets of IMDB dataset,\n",
    "which are 0s and 1s as numerical values, so that they can be directly used as\n",
    "the regression targets.\n",
    "\n",
    "## A Simple Example\n",
    "The first step is to prepare your data. Here we use the [IMDB\n",
    "dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification)\n",
    "as an example.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "\n",
    "dataset = tf.keras.utils.get_file(\n",
    "    fname=\"aclImdb.tar.gz\",\n",
    "    origin=\"http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\",\n",
    "    extract=True,\n",
    ")\n",
    "\n",
    "# set path to dataset\n",
    "IMDB_DATADIR = os.path.join(os.path.dirname(dataset), \"aclImdb\")\n",
    "\n",
    "classes = [\"pos\", \"neg\"]\n",
    "train_data = load_files(\n",
    "    os.path.join(IMDB_DATADIR, \"train\"), shuffle=True, categories=classes\n",
    ")\n",
    "test_data = load_files(\n",
    "    os.path.join(IMDB_DATADIR, \"test\"), shuffle=False, categories=classes\n",
    ")\n",
    "\n",
    "x_train = np.array(train_data.data)\n",
    "y_train = np.array(train_data.target)\n",
    "x_test = np.array(test_data.data)\n",
    "y_test = np.array(test_data.target)\n",
    "\n",
    "print(x_train.shape)  # (25000,)\n",
    "print(y_train.shape)  # (25000, 1)\n",
    "print(x_train[0][:50])  # <START> this film was just brilliant casting <UNK>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "The second step is to run the [TextRegressor](/text_regressor).  As a quick\n",
    "demo, we set epochs to 2.  You can also leave the epochs unspecified for an\n",
    "adaptive number of epochs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "\n",
    "# Initialize the text regressor.\n",
    "reg = ak.TextRegressor(overwrite=True, max_trials=1)  # It tries 10 different models.\n",
    "# Feed the text regressor with training data.\n",
    "reg.fit(x_train, y_train, epochs=2)\n",
    "# Predict with the best model.\n",
    "predicted_y = reg.predict(x_test)\n",
    "# Evaluate the best model with testing data.\n",
    "print(reg.evaluate(x_test, y_test))\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Validation Data\n",
    "By default, AutoKeras use the last 20% of training data as validation data.  As\n",
    "shown in the example below, you can use `validation_split` to specify the\n",
    "percentage.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "reg.fit(\n",
    "    x_train,\n",
    "    y_train,\n",
    "    # Split the training data and use the last 15% as validation data.\n",
    "    validation_split=0.15,\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "You can also use your own validation set instead of splitting it from the\n",
    "training data with `validation_data`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "split = 5000\n",
    "x_val = x_train[split:]\n",
    "y_val = y_train[split:]\n",
    "x_train = x_train[:split]\n",
    "y_train = y_train[:split]\n",
    "reg.fit(\n",
    "    x_train,\n",
    "    y_train,\n",
    "    epochs=2,\n",
    "    # Use your own validation set.\n",
    "    validation_data=(x_val, y_val),\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Customized Search Space\n",
    "For advanced users, you may customize your search space by using\n",
    "[AutoModel](/auto_model/#automodel-class) instead of\n",
    "[TextRegressor](/text_regressor). You can configure the\n",
    "[TextBlock](/block/#textblock-class) for some high-level configurations, e.g.,\n",
    "`vectorizer` for the type of text vectorization method to use.  You can use\n",
    "'sequence', which uses [TextToInteSequence](/block/#texttointsequence-class) to\n",
    "convert the words to integers and use [Embedding](/block/#embedding-class) for\n",
    "embedding the integer sequences, or you can use 'ngram', which uses\n",
    "[TextToNgramVector](/block/#texttongramvector-class) to vectorize the\n",
    "sentences.  You can also do not specify these arguments, which would leave the\n",
    "different choices to be tuned automatically.  See the following example for\n",
    "detail.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "\n",
    "input_node = ak.TextInput()\n",
    "output_node = ak.TextBlock(block_type=\"ngram\")(input_node)\n",
    "output_node = ak.RegressionHead()(output_node)\n",
    "reg = ak.AutoModel(\n",
    "    inputs=input_node, outputs=output_node, overwrite=True, max_trials=1\n",
    ")\n",
    "reg.fit(x_train, y_train, epochs=2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "The usage of [AutoModel](/auto_model/#automodel-class) is similar to the\n",
    "[functional API](https://www.tensorflow.org/guide/keras/functional) of Keras.\n",
    "Basically, you are building a graph, whose edges are blocks and the nodes are\n",
    "intermediate outputs of blocks.  To add an edge from `input_node` to\n",
    "`output_node` with `output_node = ak.[some_block]([block_args])(input_node)`.\n",
    "\n",
    "You can even also use more fine grained blocks to customize the search space\n",
    "even further. See the following example.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "\n",
    "input_node = ak.TextInput()\n",
    "output_node = ak.TextToIntSequence()(input_node)\n",
    "output_node = ak.Embedding()(output_node)\n",
    "# Use separable Conv layers in Keras.\n",
    "output_node = ak.ConvBlock(separable=True)(output_node)\n",
    "output_node = ak.RegressionHead()(output_node)\n",
    "reg = ak.AutoModel(\n",
    "    inputs=input_node, outputs=output_node, overwrite=True, max_trials=1\n",
    ")\n",
    "reg.fit(x_train, y_train, epochs=2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Data Format\n",
    "The AutoKeras TextRegressor is quite flexible for the data format.\n",
    "\n",
    "For the text, the input data should be one-dimensional For the regression\n",
    "targets, it should be a vector of numerical values.  AutoKeras accepts\n",
    "numpy.ndarray.\n",
    "\n",
    "We also support using [tf.data.Dataset](\n",
    "https://www.tensorflow.org/api_docs/python/tf/data/Dataset?version=stable)\n",
    "format for the training data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "\n",
    "train_set = tf.data.Dataset.from_tensor_slices(((x_train,), (y_train,))).batch(32)\n",
    "test_set = tf.data.Dataset.from_tensor_slices(((x_test,), (y_test,))).batch(32)\n",
    "\n",
    "reg = ak.TextRegressor(overwrite=True, max_trials=2)\n",
    "# Feed the tensorflow Dataset to the regressor.\n",
    "reg.fit(train_set, epochs=2)\n",
    "# Predict with the best model.\n",
    "predicted_y = reg.predict(test_set)\n",
    "# Evaluate the best model with testing data.\n",
    "print(reg.evaluate(test_set))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Reference\n",
    "[TextRegressor](/text_regressor),\n",
    "[AutoModel](/auto_model/#automodel-class),\n",
    "[TextBlock](/block/#textblock-class),\n",
    "[TextToInteSequence](/block/#texttointsequence-class),\n",
    "[Embedding](/block/#embedding-class),\n",
    "[TextToNgramVector](/block/#texttongramvector-class),\n",
    "[ConvBlock](/block/#convblock-class),\n",
    "[TextInput](/node/#textinput-class),\n",
    "[RegressionHead](/block/#regressionhead-class).\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "text_regression",
   "private_outputs": false,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}